Automatic Detection of Character Encoding and Language

نویسندگان

  • Seungbeom Kim
  • Jongsoo Park
چکیده

The Internet is full of textual contents in various languages and character encodings, and their communication across the linguistic borders is ever increasing. Since the different encodings are not compatible with one another, communication protocols such as HTTP, HTML, and MIME are equipped with mechanisms to tag the character encoding (a.k.a. charset) used in the delivered content. However, as native speakers of a language whose character set does not fit in US-ASCII, we have encountered a lot of web pages and e-mail messages that are encoded in or labeled with a wrong character encoding, which is often annoying or frustrating. Many authors, especially of e-mail messages, are not aware of the encoding issues, and their carelessness in choosing correct encodings can result in messages that are illegible to the recipient. Therefore, automatically detecting the correct character encoding from the given text can serve many people using various character encodings, including their native one, and has a good practical value. The importance of automatic charset detection is not restricted to web browsers or e-mail clients; detecting the charset is the first step of text processing. Therefore, many text processing applications should have automatic charset detection as their crucial component; web crawlers are a good example. Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naı̈ve Bayes (NB) and Support Vector Machine (SVM). Using the documents downloaded from Wikipedia [6], we evaluated different combinations of algorithms and compared them with the universal charset detector in Mozilla. We found two promising algorithms. The first one is a simple SVM whose feature is the frequency of byte values. The algorithm is uniform and very easy to implement. It also only needs maximum 256 table entries per each charset and the detection time is much shorter than other algorithms. Despite of its simplicity, it achieves 98.22% accuracy, which is comparable to that of Mozilla (99.46%). The second one is a character-based NB whose accuracy is 99.39%. It needs a larger table size and a longer detection time than the first algorithm, but it also detects the language of document as a byproduct.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Automatic Chinese Spelling Correction

This article presents a new approach for automatic Chinese spelling error detection and correction. Existing Chinese spelling checking systems have two problems: (1) low precision rate, and (2) lack of correction capability. The proposed Chinese spelling correction method is composed of two mechanisms (1) composite confusing character substitution, and (2) advanced word class bigram language mo...

متن کامل

chared: Character Encoding Detection with a Known Language

chared is a system which can detect character encoding of a text document provided the language of the document is known. The system supports a wide range of languages and the most commonly used character encodings. We explain the details of the algorithm, describe the process of creating models for various languages and present results of an evaluation on a collection of Web pages.

متن کامل

Automatic Language Identification Using Multivariate Analysis

Identifying the language of an e-text is complicated by the existence of a number of character sets for a single language. We present a language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification. We compare its performance with existing schemes viz., the N-grams and compression.

متن کامل

Determination of the Script and Language Content of Document Images

Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly availab...

متن کامل

Language Identification: The Long and the Short of the Matter

Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007